Before I launch into the questions I need to prepare the data a little to make sure that I have a suitable base to conduct my analysis. This is a very large data set with over 15 million observations, but it looks to be just a single month of data, from April 2013.
trip_data_raw <- read_csv("trip_data_4.csv")
trip_fare_raw <- read_csv("trip_fare_4.csv")
names(trip_data_raw)
names(trip_fare_raw)
The data comes across two spreadsheets but can be matched by joining on medallion,hack_license, vendor_id and pickup_datetime. Joining introduces a negligible increase in observations of about 1300 duplicates, mostly resulting from incorrect location data and payment types. I am expecting these to be stripped out when I clean variables next.
data <- left_join(trip_data_raw,trip_fare_raw, Joining, by = c("medallion", "hack_license", "vendor_id", "pickup_datetime"))
There is a bit of fairly obvious filtering which should be done to remove observations which have:
Furthermore, I can produce a few potentially useful features like speed and day of week. I am going to assume the trip distance variable is in miles and therefore calculate speed in miles per hour. I am also assuming that average speed should be below 70 miles per hour.
cleaning_1 <- data %>%
filter(passenger_count != 0) %>%
filter(trip_time_in_secs > 30) %>%
filter(trip_distance > 0.1 ) %>%
filter(pickup_longitude > -75 & pickup_longitude < -73) %>%
filter(dropoff_longitude > -75 & dropoff_longitude < -73) %>%
filter(pickup_latitude > 40 & pickup_latitude < 42) %>%
filter(dropoff_latitude > 40 & dropoff_latitude < 42)
cleaning_2 <- cleaning_1 %>%
mutate(avg_speed_mh = trip_distance/trip_time_in_secs * 60 * 60) %>%
filter(avg_speed_mh < 70)
cleaning_3 <- cleaning_2 %>%
mutate(pickup_hour = floor_date(pickup_datetime,"hour")) %>%
mutate(day_of_week = wday(pickup_hour, label = TRUE))
This removes about 500,000 observations leaving well over 14 million.
Even though I am working from a Google Cloud Platform instance the size of the data is inhibiting my ability to move quickly. One solution to this, which will be useful down the line when modelling is to split the data into a train, validation and test set. Not only does splitting before any analysis maintain the integrity of the test set, but it also allows me to work with a smaller, more manageable dataset.
Given this, I’ve decided to go with a 50/25/25 split using sample twice to achieve the necessary splits
set.seed(11)
index <- sample(1:nrow(cleaning_3), nrow(cleaning_3)/2)
train <- cleaning_3[index,]
remainder <- cleaning_3[-index,]
set.seed(41)
remainder_index <- sample(1:nrow(remainder), nrow(remainder)/2)
val <- remainder[remainder_index,]
test <- remainder[-remainder_index,]
There may still be a need for cleaning variables but it might be best to look at this after addressing some of the basic questions
The number of passengers per trip is a discrete value and has a mean of 1.72. As this is count data which has a zero lower bound and the mean is close to 1, we would expect to see something like a poisson distibution which is highly right-skewed with a mode of 1 down to the maximum value of 6. This is precisely what we see in the below plot.
The below plot shows the frequency associated with each payment type and clearly indicates a bimodal structure within the categorical variable. Paying by cash or card represent the two options, with a variable small number of disputed, not charged or unknown payments.
Interestingly, I will show below that tips are only really recorded when the passenger pays with card. Cash tips are obviously just pocketed by the driver.
Fare amount is a right-skewed distribution with a median of 9.5 and a mean of 12.22. The most expensive fares are over $400 but most fares are below $50. Strangely, there is a large spike at exactly $52 which may suggest situations where the drive does not go by the meter and instead has a fixed cost for a certain route.
Stealing a bit of thunder from my future investigations I am able to classify where these routes are and confirm that 65176 are picked up from JFK Airport and 44337 are being dropped off at JFK airport. As shown below, these passengers are mainly being taken to and from Manhattan.
pickup_JFK <- train_2 %>%
select(fare_amount,pickup_zone,dropoff_borough) %>%
filter(fare_amount == 52) %>%
filter(pickup_zone == "JFK Airport") %>%
count(dropoff_borough) %>%
arrange(desc(n))
pickup_JFK
## # A tibble: 5 x 2
## dropoff_borough n
## <chr> <int>
## 1 Manhattan 63747
## 2 Queens 808
## 3 Brooklyn 450
## 4 Bronx 87
## 5 <NA> 84
dropoff_JFK <- train_2 %>%
select(fare_amount,dropoff_zone,pickup_borough) %>%
filter(fare_amount == 52) %>%
filter(dropoff_zone == "JFK Airport") %>%
count(pickup_borough) %>%
arrange(desc(n))
dropoff_JFK
## # A tibble: 6 x 2
## pickup_borough n
## <chr> <int>
## 1 Manhattan 43209
## 2 Queens 891
## 3 Brooklyn 167
## 4 <NA> 55
## 5 Bronx 14
## 6 Staten Island 1
The below plot shows the distribution of tip amount as predominatly 0 and then tailing off despite some large outlier tip values.
This plot isn’t very clear so I have zoomed in on the data to show tips up to $20 and colour coded this based on the payment method. It is clear from this plot that drivers are more than likely also receiving cash tips but just not recording them.
The distribution of tip amount by card, where we can more reliably consider the tip amount, resembles a poisson distribution around a mean of 2.48.
The distribution of total amount looks much like that of fare amount, that is a heavily right-skewed distributon, although the spike at $52 has been washed out somewhat by tips and taxes.
As one might expect, the busiest hours for taxis are in the evening, where between 6-11pm where there is a noticeable platform compared with the rest of the day. It’s also of interest that fares plummet between midnight and 5am but recover within 2 hours for the morning rush.
busy_hours <- train %>%
group_by(pickup_hour) %>%
summarise(count= n()) %>%
mutate(rank = rank(desc(count))) %>%
mutate(top_5 = case_when(rank <= 5 ~ "Yes",
TRUE ~ "No"))
It would be interesting to see how working in the period of time performs as one might expect it the market to be saturated with relatively short low yield trips as people travel to meet engagements.
As touched on earlier, with this question in mind I sourced some additional spatial data to help make sense of the location coordinates provided in the data set. The below plot shows why I feel this was necessary, as each coordinate was relatively distinct and so made it very challenging to group the data.
I could have rounded the coordinates to generate a grid system, and this is what I would have attempted if I wasn’t able to locate a shapefile which served a similar purpose but provided more meaning to the data.
Fortunately I found a great shapefile of taxi zones at the Taxi and Limousine commision site and I was able to join the data based on the simple features themselves - see investigations file for how this was done.
busy_pickup_locations <- train_2 %>%
group_by(pickup_zone) %>%
summarise(pickup= n())
busy_dropoff_locations <- train_2 %>%
group_by(dropoff_zone) %>%
summarise(dropoff= n())
taxi_zones_to_borough <- taxi_zone_busy %>%
select(zone,borough) %>%
st_drop_geometry()
busy_overall <- busy_pickup_locations %>%
left_join(busy_dropoff_locations, by = c("pickup_zone" = "dropoff_zone")) %>%
mutate(overall = pickup + dropoff) %>%
left_join(taxi_zones_to_borough, by = c("pickup_zone" = "zone")) %>%
select(zone = pickup_zone, borough, overall) %>%
mutate(rank = rank(desc(overall))) %>%
mutate(top_10 = case_when(rank <= 10 ~ "Yes",
TRUE ~ "No"))
busy_overall %>% filter(top_10 == "Yes") %>% arrange (desc(overall)) %>% select(zone, borough,trips = overall)
## # A tibble: 10 x 3
## zone borough trips
## <chr> <chr> <int>
## 1 Midtown Center Manhattan 560457
## 2 Upper East Side South Manhattan 530373
## 3 Upper East Side North Manhattan 507909
## 4 Murray Hill Manhattan 494400
## 5 Midtown East Manhattan 487248
## 6 Times Sq/Theatre District Manhattan 481249
## 7 Union Sq Manhattan 474616
## 8 East Village Manhattan 432718
## 9 Clinton East Manhattan 420493
## 10 Penn Station/Madison Sq West Manhattan 405917
The above table shows that the busiest locations, as measured by taxi zones of pickups and dropoffs are all in Manhattan.
Visualising the data on a map clearly shows Manhattan’s prominence but also indicates a large number of trips to the airports at La Guardia and JFK.
Originally I wanted to classify this using a trip consisting of a zones and hours. The reason I wanted to do this is that travel time could be expected to vary greatly based on traffic (which was probably the actual intent of the question). But when I did this JFK to JFK showed up pretty unanimously at all hours. This is probably fair since a some trips would have been a few hundred meters down the road and surely others would have been half-way across the city and back to retrieve important items. So instead I am looking independent of time and looking at trips with at least 20 observations.
high_sd_time <- train_2 %>%
group_by(pickup_zone,dropoff_zone) %>%
summarise(st_dev_time = sd(trip_time_in_secs),
average_distance = mean(trip_distance),
n = n()) %>%
filter(!is.na(pickup_zone)) %>%
filter(!is.na(dropoff_zone)) %>%
filter(n > 20) %>%
arrange(desc(st_dev_time)) %>%
ungroup() %>%
head(10)
high_sd_time
## # A tibble: 10 x 5
## pickup_zone dropoff_zone st_dev_time average_distance n
## <chr> <chr> <dbl> <dbl> <int>
## 1 Maspeth JFK Airport 1944. 13.6 23
## 2 South Jamaica Midtown South 1756. 12.0 21
## 3 JFK Airport JFK Airport 1468. 4.98 2051
## 4 Flushing Meadows-Coro… Times Sq/Theatre D… 1468. 10.5 38
## 5 Garment District Flushing 1242. 11.1 23
## 6 Flushing Meadows-Coro… JFK Airport 1217. 10.7 47
## 7 LaGuardia Airport Brighton Beach 1214. 21.3 38
## 8 Penn Station/Madison … South Ozone Park 1175. 15.2 33
## 9 JFK Airport West Chelsea/Hudso… 1168. 17.1 544
## 10 JFK Airport Newark Airport 1167. 31.6 70
Going from Masbeth to JFK tops the lot with a standard deviation of over high_sd_time %>% head(1) %>% mutate(x = round(st_dev_time,0)/60) %>% select(x) %>% pull() minutes. These trips are on average 13.6 miles and so this seems reasonable. The below map shows these two trips to give you an idae of the distances involved.
Taking a similar approach to the previous question shows two short journeys going from Cobble Hill to Colombia Street in Brooklyn and from Seaport to Battery Park in Manhattan as having the most consistent overall fare. In the case of the all the top 10 the standard deviation is showing that the majority of fares are within $1.70 of the average total fare.
Again, this makes sense as they are predominatenly stort trips, mostly below 2 miles.
low_sd_fare <- train_2 %>%
group_by(pickup_zone,dropoff_zone) %>%
summarise(st_dev_fare = sd(total_amount),
average_distance = mean(trip_distance),
n = n()) %>%
filter(!is.na(pickup_zone)) %>%
filter(!is.na(dropoff_zone)) %>%
filter(n > 20) %>%
arrange(st_dev_fare) %>%
ungroup() %>%
head(10)
high_sd_time
## # A tibble: 10 x 5
## pickup_zone dropoff_zone st_dev_time average_distance n
## <chr> <chr> <dbl> <dbl> <int>
## 1 Maspeth JFK Airport 1944. 13.6 23
## 2 South Jamaica Midtown South 1756. 12.0 21
## 3 JFK Airport JFK Airport 1468. 4.98 2051
## 4 Flushing Meadows-Coro… Times Sq/Theatre D… 1468. 10.5 38
## 5 Garment District Flushing 1242. 11.1 23
## 6 Flushing Meadows-Coro… JFK Airport 1217. 10.7 47
## 7 LaGuardia Airport Brighton Beach 1214. 21.3 38
## 8 Penn Station/Madison … South Ozone Park 1175. 15.2 33
## 9 JFK Airport West Chelsea/Hudso… 1168. 17.1 544
## 10 JFK Airport Newark Airport 1167. 31.6 70
Mean can be used as a measure of central tendancy when the data is symetrical, i.e. not exihibiting signs of skewness. As established above, fare is heavily right-skewed when viewing the data as a whole, but when looking at particular trips it could be expected that you would see distributions which are more normally distributed.
We can look at parametric statistical tests to understand how a sample distribution compares to a normal distribution, with one example being the Shapiro-Wilk’s test. A statisitc close to 1 suggests normal distribution, but it is best to view the histogram or qplot to confirm.
Just a quick aside, when applying the Shapiro-Wilk test in R the sample size is limited to between 3 and 5000. This is primarily because the null hypothesis is that the distibution is normal (which is odd for an hypothesis test as usually it would be the other way around) and very small samples tend to confirm the null while very large sample sizes tend to reject the null.
Let’s take a look at the fare over trips:
## # A tibble: 10 x 4
## trip statistic p.value method
## <chr> <dbl> <dbl> <chr>
## 1 Midtown North to Seaport 0.990 0.0895 Shapiro-Wilk normali…
## 2 JFK Airport to Queens Village 0.995 0.613 Shapiro-Wilk normali…
## 3 Central Park to Lower East Side 0.993 0.354 Shapiro-Wilk normali…
## 4 World Trade Center to Gramercy 0.992 0.0275 Shapiro-Wilk normali…
## 5 Central Park to Stuy Town/Peter Coop… 0.992 0.603 Shapiro-Wilk normali…
## 6 Central Park to Two Bridges/Seward P… 0.994 0.952 Shapiro-Wilk normali…
## 7 Times Sq/Theatre District to East El… 0.990 0.236 Shapiro-Wilk normali…
## 8 Central Park to Washington Heights N… 0.993 0.685 Shapiro-Wilk normali…
## 9 SoHo to Stuyvesant Heights 0.991 0.775 Shapiro-Wilk normali…
## 10 Astoria to Flushing 0.991 0.956 Shapiro-Wilk normali…
The above table shows the top 10 trips based on the Shapiro-Wilk test. Examining one of these examples, we would expect to see data which is normally distributed.
shap_fare_2 <- shap_fare %>%
filter(!str_detect(trip,"JFK")) %>%
top_n(-10,statistic)
shap_fare_2
## # A tibble: 10 x 4
## trip statistic p.value method
## <chr> <dbl> <dbl> <chr>
## 1 Brooklyn Heights to Cobble Hill 0.213 8.47e-32 Shapiro-Wilk norma…
## 2 Midtown East to UN/Turtle Bay South 0.117 2.15e-87 Shapiro-Wilk norma…
## 3 Long Island City/Queens Plaza to Old … 0.102 1.57e-32 Shapiro-Wilk norma…
## 4 Baisley Park to Times Sq/Theatre Dist… 0.121 1.30e-19 Shapiro-Wilk norma…
## 5 East Harlem South to Manhattanville 0.198 2.32e-30 Shapiro-Wilk norma…
## 6 South Williamsburg to Williamsburg (S… 0.163 1.27e-19 Shapiro-Wilk norma…
## 7 Baisley Park to Midtown North 0.137 1.21e-14 Shapiro-Wilk norma…
## 8 Baisley Park to Murray Hill 0.129 2.44e-15 Shapiro-Wilk norma…
## 9 Baisley Park to Midtown East 0.113 6.16e-17 Shapiro-Wilk norma…
## 10 Springfield Gardens South to Times Sq… 0.167 1.42e-13 Shapiro-Wilk norma…
On the other hand, this table shows some low scoring trips (I’ve excluded trips involving JFK as the already established $52 fares create some very strange distributions), and we would expect to see data that is not normally distributed. These would be cases where you would probably not want to use the mean as a central measure of tendancy. Such at the below, where the mean value would be raised away from the mode and median by a few significant outliers.
The same can be done for trip time:
## # A tibble: 10 x 4
## trip statistic p.value method
## <chr> <dbl> <dbl> <chr>
## 1 Financial District North to Centr… 0.992 2.40e-1 Shapiro-Wilk normali…
## 2 Bloomingdale to Union Sq 0.996 9.07e-1 Shapiro-Wilk normali…
## 3 Upper East Side South to SoHo 0.991 1.32e-7 Shapiro-Wilk normali…
## 4 Upper East Side North to SoHo 0.991 7.06e-4 Shapiro-Wilk normali…
## 5 Lincoln Square East to Little Ita… 0.992 1.12e-4 Shapiro-Wilk normali…
## 6 Bloomingdale to Greenwich Village… 0.992 8.69e-1 Shapiro-Wilk normali…
## 7 Upper West Side South to Little I… 0.993 5.03e-2 Shapiro-Wilk normali…
## 8 Bloomingdale to Murray Hill 0.992 3.37e-1 Shapiro-Wilk normali…
## 9 Manhattanville to Lincoln Square … 0.993 9.54e-1 Shapiro-Wilk normali…
## 10 Murray Hill to Bedford 0.992 4.14e-1 Shapiro-Wilk normali…
And we would expect a similar result:
Thinking logically, we should be able to predict fares with reasonable accuracy. My belief for this stems from the prescibed nature of fares where we could expect variables like distance travelled, length of trip, number of passengers, time of day, tolls etc to be taken into account in determining a fare. Whether we can accurately predict the fare from only location, time of day and day of week is less clear (I am assuming the question meant to say day of week as this would seem to me to make more sense).
In terms of tip amount, we have already established that the data only captures tip amounts when they are paid by card. Even if I were to isolate for this, I wouldn’t expect a model to be very effective in estimating tip amounts as it depends on many more human variables which are very challenging to measure. How generous is the passenger? How friendly was the driver? How urgent was the trip? All variables that are going to be very challenging to measure.
Below I have demonstrated how a random forest using only 5,000 observations from the training set can do an okay job at predicting fare amount. To aid the algorithm I have simplified time of day into 5 periods.
## # A tibble: 6 x 8
## pickup_longitude pickup_latitude dropoff_longitu… dropoff_latitude day_of_week
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 -74.0 40.8 -73.9 40.8 Mon
## 2 -74.0 40.7 -74.0 40.8 Mon
## 3 -74.0 40.8 -74.0 40.8 Sat
## 4 -74.0 40.7 -74.0 40.7 Wed
## 5 -74.0 40.8 -74.0 40.8 Sun
## 6 -74.0 40.8 -74.0 40.7 Mon
## # … with 3 more variables: pickup_time <fct>, fare_amount <dbl>, pred <dbl>
## # A tibble: 1 x 1
## rmse
## <dbl>
## 1 4.14
This model achieves an root mean squared error (RMSE) of 4.1429636. This suggests that on average the predicted values are about 4.14 dollars away from the actual value of fare amount. The below plot of actual vs predicted values shows that this model performs relatively well.
Taking aim at tip amount the below model shows that RMSE alone may not tell the whole story.
This model achieves an RMSE of 1.8154988. This suggests that on average the predicted values are about 1.82 dollars away from the actual value of tip amount. However, the below plot of actual vs predicted values shows that the model tends to underestimate tip amount. I’d suspect that this was on account of the 0 values coming from cash transactions.
Another approach might be to look at building a number of smaller models to help understand which parts of the data set are explained well by a model. For example, we can build a different model for each hour of the day and day of the week and see how well the pickup and dropoff coordinates do at explaining fare amount.
nested_train <- train_2 %>%
select(fare_amount,pickup_lon,pickup_lat,dropoff_lon,dropoff_lat,pickup_hour,day_of_week) %>%
group_by(pickup_hour,day_of_week) %>%
nest()
nested_train_coord <- nested_train %>%
mutate(model = map(data,~lm(formula = fare_amount ~ pickup_lon + pickup_lat +dropoff_lon +dropoff_lat,data = .x)))
mutate(glance = map(model,glance)) %>%
unnest(glance) %>%
select(-c(data,model))
As shown below, pickup and dropoff locations as measured in coordinates do a fairly poor job of explaining the variation in fare amount. They are particularly poor in the early hours of the morning.
On the other hand, we could replace the coordinates with the taxi zones I added for the earlier question and see whether that makes a difference.
nested_train_1 <- train_2 %>%
select(fare_amount,pickup_zone,dropoff_zone,pickup_hour,day_of_week) %>%
group_by(pickup_hour,day_of_week) %>%
nest()
nested_train_zone <- nested_train_1 %>%
mutate(model = map(data,~lm(formula = fare_amount ~ pickup_zone + dropoff_zone,data = .x)),
glance = map(model,glance)) %>%
unnest(glance) %>%
select(-c(data,model))
Here we see a remarkably different story, with nearly all models explaining above 60% of the variation in fare amount. The very interesting point is that 5am, which was a genuine challenge to model with coordinates is the standout performer within the hour/day models. This is an great example of where generating a categorical variable from a continuous variable can help the model better interpret the underlying patterns.
I am going to approach the next two questions in different ways - the first requires identifying top performing drivers aka hacks and then looking specifically at their data to understand what they do differently. The second is to look purely at the things which I can focus on, time and location.
After grouping the full data set by hack license (not by medallian as it appears as though some drivers work across several cabs) I am able to compute some summary statistics, such as:
I then want to filter out outliers which is mainly achieved by doing some common sense stuff, i.e. Must have worked at least 10 days, at least 4 hours per day, done over 100 trips and travelled at least 50 miles.
The objective here is to maximise earnings in a day, so I think a good place to start is who is maximising earnings per hour after petrol is account for.
## # A tibble: 10 x 18
## hack_license total_trips total_hours_wor… total_miles_tra… total_fare
## <chr> <int> <dbl> <dbl> <dbl>
## 1 1E94B13BB69… 587 136. 3378. 12083.
## 2 097CA0DEB62… 697 114. 1811. 10242.
## 3 C86A902115E… 151 90.8 2623. 7641
## 4 C4B3B913C14… 683 120. 2715. 9426.
## 5 BB415410427… 321 128. 2854. 10330.
## 6 C2BD6ED6EE8… 295 119. 2898. 9478.
## 7 CBC0F1322B8… 561 112. 2453. 8523.
## 8 B06FE0F38ED… 290 84.0 1822. 6560.
## 9 4FBAD760692… 870 133. 2684. 10074
## 10 D31F445F966… 462 131. 3229. 10310.
## # … with 13 more variables: total_tips <dbl>, total_take <dbl>,
## # total_paid_for_petrol <dbl>, total_earnings <dbl>,
## # total_days_working <int>, total_mediallians_used <int>,
## # avg_hours_per_day <dbl>, avg_earnings_per_day <dbl>,
## # avg_earnings_per_trip <dbl>, avg_earnings_per_hour <dbl>,
## # avg_earnings_per_mile <dbl>, avg_earnings_per_hour_and_per_mile <dbl>,
## # avg_hourly_earnings_after_petrol <dbl>
After doing that I have the top 10 drivers, who are all earning over $90 per hour. Take a bow 1E94B13BB698BC3C98178429C45FDEED.
So what is this driver doing that is so amazing? Well they are primarily waiting at La Guardia to take people into Manhattan.
What about the other guys in the top 10, are they all at La Guardia? No, in fact as you can see below the preferred lurking spot is actually at JFK.
What about when it comes to hours? Well the top guy does exclusively late nights and the top 10 seem to prefer that as well.
Regardless, what these guys are doing is definitely working as they are well above the pack. Based on this, to maximise my earnings I would base my pickup strategy entirely around working out of La Guardia or JFK and aim to work late evenings to early mornings.
Post-work note: I’ve noticed two key assumption that will most likely void the above. I’ve assumed hourly work to be based on the sum of trip times which excludes the important time spent looking for a fare. An improved analysis would require looking at the sequence of dropoff and pickup times to understand periods of work.
For this question I am taking an entirely different approach, not looking at drivers but instead looking at the things a driver can influence, where they look for fares, on what days and at what times.
The median monthly earnings for a driver in this data is $6755. Looking at the various pickup locations, days and times the top quartile (assuming I could find it) would have an average of 76 hours to hit this average monthly earnings. Assuming I like the idea of weekends this works out to a bit over 3 hours per day.
So ignoring locations for now, what day and hours should I be working? Basically I want to be starting after midnight and packing in before 6am.
So with that in mind, I’ve generated a map which could sit in my dashboard window telling me where I should be:
For this question I will try to find the areas where the taxis are consistently performing above expectation. My approach is to look at medallions rather than drivers and rank them into deciles by total earnings after petrol. From here, I can extract the taxi zones with the highest average decile score and identify the following:
company_plan <- all_clean_data %>%
group_by(medallion) %>%
mutate(total_miles_travelled = sum(trip_distance),
total_take = sum(total_amount),
total_paid_for_petrol = total_miles_travelled/28*3.32,
total_earnings = total_take - total_paid_for_petrol) %>%
ungroup() %>%
mutate(earnings_decile = ntile(total_earnings,10)) %>%
group_by(pickup_borough,pickup_zone,day_of_week,pickup_hour) %>%
summarise(n = n(),
competing_taxi = n_distinct(medallion),
average_earnings_decile = mean(earnings_decile))
company_plan_1 <- company_plan %>%
filter(n >1000)
company_plan_2 <- company_plan_1 %>%
group_by(pickup_zone,pickup_borough) %>%
summarise(average_earnings_decile = mean(average_earnings_decile)) %>%
arrange(desc(average_earnings_decile)) %>%
head(10)
company_plan_2
## # A tibble: 10 x 3
## # Groups: pickup_zone [10]
## pickup_zone pickup_borough average_earnings_decile
## <chr> <chr> <dbl>
## 1 Clinton East Manhattan 5.68
## 2 Penn Station/Madison Sq West Manhattan 5.66
## 3 Meatpacking/West Village West Manhattan 5.65
## 4 East Harlem South Manhattan 5.64
## 5 Lower East Side Manhattan 5.64
## 6 West Village Manhattan 5.64
## 7 Williamsburg (North Side) Brooklyn 5.63
## 8 East Chelsea Manhattan 5.62
## 9 Lenox Hill West Manhattan 5.61
## 10 Williamsburg (South Side) Brooklyn 5.61
With this information in hand, I would have my taxis run at a very high rate, using multiple drivers, as the I have noticed that the value of fares outweighs the cost in petrol (and most likely in car replacement). The below map suggests working mid to lower Manhattan and Brooklyn and that if these placed were focused on earnings would be likely to be above average.